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Declaration Under 37 CFJR. § L132 of En Li, PhJ). 

Commissioner for Patents 
P.O. Box 1450 
Alexandria, VA 22313-1450 



I, the undersigned, En Li, Ph.D., residing at 45 Hinckley Road, 
Newton, Massachusetts, 02168, declare and state as follows: 

1 . I am a co-inventor of the above-captioned patent application. 

2. I am currently employed by Novartis Institutes for Biomedical Research as 
Vice President & Global Head, Animal Models of Disease and Epigenetics Program. Prior 
to my current employment, I was an Associate Professor of Medicine at Harvard Medical 
School and directed a laboratory in the Cardiovascular Research Center at the Massachusetts 
General Hospital from January 1993 to April 2003, where I conducted and supervised 
research in the field of mouse genetics and developmental biology. 

3 . A current curriculum vitae is appended hereto as EXHIBIT A. 

4. I have reviewed the abo ve-captioned patent application and the Office Action 
dated June 6, 2005. I have also reviewed the sequence listing as filed and the sequence 
listing as amended on July 23, 2001 . T have also reviewed the claims of the captioned patent 
application. 

5 . I have been informed that the Examiner has not granted priority to the earlier 
filed' patent applications because there is insufficient proof that the coding regions of 
currently amended SEQ ID NOS:l and 2 are the same as those listed in the priority 
documents, viz., the mouse Dront3a and Dnmt3b cDNA clones encoding the coding regions 
of SEQ ID NOS:l and 2, respectively, that were deposited with the American Type Culture 
Collection (ATCC), 10801 University Boulevard, Manassas, Virginia 201 10-2209, USA. 
Sequences harboring the coding regions of SEQ ID NOS: 1 and 2, respectively, were 
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deposited with the ATCC on June 16, 1998, and assigned ATCC Deposit Nos. 209933 and 
209934, respectively. The deposit date of June 16, 1998 was prior to the filing date of the 
first provisional application* App. No. 60/090,906, filed June 25, 1 998, the benefit of which 
is claimed. The '906 application includes the sequence infonnation and references the 
deposits of the sequenced material on page 15, lines 26, through page 16, line 2, of the 
specification. 

6. In November 2004, Applicants had samples withdrawn of the mouse Dnmt3a 
and Dnmt3b cDNA clones contained within ATCC Deposit Nos. 209933 and 209934> 
respectively. At the Applicants request, Kenneth D, Bloch, M.D., a faculty member in the 
Cardiovascular Research Center at the Massachusetts General Hospital and an experienced 
DNA sequencer, sequenced nucleotides that spanned the coding regions of mouse Dnmt3a 
and Dnmt3b in the deposited cDNA. A nucleotide alignment that spans the coding regions 
of sequenced mouse Dnmt3a cDNA clone contained in ATCC Deposit No. 209933 and 
currently amended SEQ ID NO:l is shown in EXHIBIT B. A nucleotide alignment that spans 
the coding regions of sequenced mouse Dnmt3b cDNA clone contained in ATCC Deposit 
No. 209934 and currently amended SEQ ID NO:2 is shown in Exhibit C. 

7. The amendment to the sequence listing, which was filed on July 23, 2001, 
corrected six nucleotides in the coding sequence of SEQ ID NO:l (see the bolded 
nucleotides at positions 516, 843, 1036, 1110, 1116 and 1726 in EXHIBIT B) and two 
nucleotides in the coding sequence of SEQ ID NO;2 (see the bolded nucleotides at positions 
918 and 920 in Exhibit C). 

8. The deposited clones recited in ffi[5 and 6, above (Le. y ATCC Deposit Nos. 
209933 and 209934) are the same as the deposited clones recited in the above-captioned 
application. The coding sequence of ATCC Deposit No. 209933 is currently believed to be 
the same as the coding sequence of currently amended SEQ ID NO:l. The coding sequence 
of ATCC Deposit No. 209934 is cuirently believed to be the same as the coding sequence of 
currently amended SEQ ID NO:2. 

9- It is well known that sequencing errors are a common problem in Molecular 
Biology. See, e.g., Peter Richterich, Estimation of Errors in Ttaw' DNA Sequences: A 
Validation Study, 8 Genome Research 251-59 (1998)(EXHIBITD), Ibelieve that one skilled 
in the art would have sequenced the deposited material and recognized the sequencing errors 
in the coding region. I believe that the correct mouse Dnmt3a and Dnmt3b coding 
sequences are inherent to the ATCC deposited clones, ATCC Deposit Nos. 209933 and 
209934, respectively, which were deposited prior to the filing of App. No. 60/090,906, filed 
June 25, 1998, the benefit of which is claimed, 

1 0. Accordingly, based on the above, I believe that Applicants are entitled to the 
June 25, 1998 filing date for the coding sequences ofmouseDnmt3 a and Dnmt3b contained 
within ATCC Deposit Nos. 209933 and 209934, respectively. 
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11. I hereby declare that all statements made herein of my own knowledge are 
true and that all statements made on information and belief are believed to be true; and 
further that these statements were made with the knowledge that willful false statements and 
the like so made are punishable by fine or imprisonment, or both, under Section 1001 of 
Title, 1 8 of the United States Code and that such willful false statements may jeopardize the 
validity of the present patent application or any patent issued thereon. 

Respectfully submitted, 




En Li, Ph.D. 



Date: // J V / ' 
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En Li, Ph.D. 

Vice President & Global Head 
Animal Models of Disease 
Novartis Institute for Biomedical Research, Inc. 
250 Mass Ave. Cambridge, MA 02139 
Tel: 617-871-7072 
Fax. 617-871-7263 
Email, en.li@novartis.com 



Education: 

1984 B.Sc. in Biochemistry. Peking University, Beijing, China 

1992 Ph.D. Massachusetts Institute of Technology (Biology) 

(Advisor: Prof. Rudolf Jaenisch) 

Professional Experience: 

1993-1996 Principal Investigator, Cardiovascular Research Center, Massachusetts General 

Hospital 

Instructor, Department of Medicine, Harvard Medical School 
1996-2000 Assistant Professor, Department of Medicine, Harvard Medical School 

2000- 2003 Associate Professor, Department of Medicine, Harvard Medical School 

2001- 2003 Guest Professor, Beijing University, Health Science Center 

2003- VP & Global Head, Models of Disease Center, Epigenetics Program 

Novartis Institute for Biomedical Research. 



Review and editorial board 



1999- External Grant Reviewer, Human Frontier Science Program 

1 999- External Grant Reviewer, NIH, NICHD 

1 999- Member of the Advisory Board. Journal of Biochemistry 

2000- External Grant Reviewer, NIH, NIA 
2000- External Grant Reviewer, NSF 
2000- Mail Reviewer, Wellcome Trust 

2003- Member of the Advisory Board. China Science Reports 

2004 External Grant Reviewer, Chinese Natural Science Foundation. 

1 993 - Ad Hoc Reviewer for the following j ournals 



Nature, Science, Cell, Nat Genet, Genes Dev, Trends Genet, Development, Mol Cell 
Biol, PNAS, Human Mol. Genet., Dev. Biol., Mech. Dev., J. Cell Biol., Dev. Dyn., 
Gene, Genomics, Nucl Acid Res, Mammalian Genome, etc. 



Professional Societies 

1 990- American Association for the Advancement of Science, Member 



1 994- DNA Methylation Society, Member 

1 999- Ray Wu Society, Member 

Invited Presentations (Since 2003-) 

2003 Invited speaker, Gordon Research Conference - Cancer Genetics and Epigenetics 

2003 Session chair, Keystone Symposium - Chromatin (Big Sky, Montana) 

2003 Invited speaker, Annual HUGO meeting at Cancun, Mexico 

2003 Invited speaker, Gordon Research Conference - Epigenetics 

2004 Invited speaker, the 2 nd Annual CDB symposium, Kobe, Japan 

2004 Invited speaker, 2 nd Weiseenburg Symposium on DNA methylation - an important genetics 

signals. Weissenburg, Germany 

2004 Session Chair, on genomic imprinting. 10 th SCBA International Symposiums, Beijing, China 

2004 Invited speaker, Genomic imprinting workshop in Montpellier, France 

2005 Vice Chair, Gordon Research Conference 'Cancer Genetics and Epigenetics' 
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Exhibit B 

Alignment spanning the coding region of mouse Dnmt3a sequence from ATCC 
Deposit No. 209933 (top) and currently amended SEQ ID NO:l (bottom) 1 

atgccctccagcggccccggggacaccagcagctcctctctggagcgggaggatgatcga 

IIMM lllllll MIMMIMMI III III III llllllllll III llllllllll 

atgccctccagcggccccggggacaccagcagctcctctctggagcgggaggatgatcga 276 
aaggaaggagaggaacaggaggagaaccgtggcaaggaagagcgccaggagcccagcgcc 

INI III llllllllll llllllllll III III III lllllll llllll llllllllll I 

aaggaaggagaggaacaggaggagaaccgtggcaaggaagagcgccaggagcccagcgcc 3 36 
acggcccggaaggtggggaggcctggccggaagcgcaagcacccaccggtggaaagcagt 
acggcccggaaggtggggaggcctggccggaagcgcaagcacccaccggtggaaagcagt 3 96 
gacacccccaaggacccagcagtgaccaccaagtctcagcccatggcccaggactctggc 

lllllll Mill II II 1 1 INI I II II I II III III 1 1 1 1 1 1 1 1 II 1 1 1 llllllllll I 

gacacccccaaggacccagcagtgaccaccaagtctcagcccatggcccaggactctggc 456 
ccctcagatctgctacccaatggagacttggagaagcggagtgaaccccaacctgaggag 
ccctcagatctgctacccaatggagacttggagaagcggagtgaaccccaacctgaggag 516 
gggagcccagctgcagggcagaagggtggggccccagctgaaggagagggaactgagacc 

llllllllllllll llllllllllllllllllllllllllll lllllll lllllll INI 

gggagcccagctgcagggcagaagggtggggccccagctgaaggagagggaactgagacc 576 
ccaccagaagcctccagagctgtggagaatggctgctgtgtgaccaaggaaggccgtgga 

iiiiiii iiiiiiiiiiiiiiii iiiiiii iiiiii iiiiiiiiii ii mill ii mm 

ccaccagaagcctccagagctgtggagaatggctgctgtgtgaccaaggaaggccgtgga 636 
gcctctgcaggagagggcaaagaacagaagcagaccaacatcgaatccatgaaaatggag 
gcctctgcaggagagggcaaagaacagaagcagaccaacatcgaatccatgaaaatggag 696 
ggctcccggggccgactgcgaggtggcttgggctgggagtccagcctccgtcagcgaccc 

MM Ml IIIIIIIIIIIIIIII lllllll llllll IMIIMI II MUM II II MM 

ggctcccggggccgactgcgaggtggcttgggctgggagtccagcctccgtcagcgaccc 756 
atgccaagactcaccttccaggcaggggacccctactacatcagcaaacggaaacgggat 

lllllll MIMIIIIIIII III IMIIIIIIIIIIIMIIIIIII MIMIMIIIMI 

atgccaagactcaccttccaggcaggggacccctactacatcagcaaacggaaacgggat 816 
gagtggctggcacgttggaaaagggaggctgagaagaaagccaaggtaattgcagtaatg 

MM Ml MIIIIIMI MUM III MM llllll lllllll MIMIIMIIII Ml 

gagtggctggcacgttggaaaagggaggctgagaagaaagccaaggtaattgcagtaatg 876 
aatgctgtggaagagaaccaggcctctggagagtctcagaaggtggaggaggccagccct 

Ml Ml MMMIIIIIIIMI MM IMIIIIII iMIIIIMI Ml lllllll Ml I 

aatgctgtggaagagaaccaggcctctggagagtctcagaaggtggaggaggccagccct 93 6 
cctgctgtgcagcagcccacggaccctgcttctccgactgtggccaccacccctgagcca 

MM Ml IMIIMIII Mill I lllllll MMM MIIIIIIIIMI lllllll III 

cctgctgtgcagcagcccacggaccctgcttctccgactgtggccaccacccctgagcca 996 



1 Bolded nucleotides indicate nucleotides that were amended on July 23, 2001. 
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gtaggaggggatgctggggacaagaatgctaccaaagcagccgacgatgagcctgagtat 

1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 r 1 1 M 1 1 ! t f < 1 1 1 1 ! 1 1 1 1 r 1 1 1 ! 1 1 1 1 1 1 1 1 1 ! 1 1 1 M 1 1 1 1 1 

gtaggaggggatgctggggacaagaatgctaccaaagcagccgacgatgagcctgagtat 1056 
gaggatggccggggctttggcattggagagctggtgtgggggaaacttcggggcttctcc 

II MM MINIM III! MM MINIM I III INI MM II MINIM Mill! II 

gaggatggccggggctttggcattggagagctggtgtgggggaaacttcggggcttctcc 1116 
tggtggccaggccgaattgtgtcttggtggatgacaggccggagccgagcagctgaaggc 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 f 1 1 1 1 1 1 

tggtggccaggccgaattgtgtcttggtggatgacaggccggagccgagcagctgaaggc 1176 
actcgctgggtcatgtggttcggagatggcaagttctcagtggtgtgtgtggagaagctc 

N NINIIINII INI INI NNINIINI ININNII II NIIIIIINII II 

actcgctgggtcatgtggttcggagatggcaagttctcagtggtgtgtgtggagaagctc 123 6 
atgccgctgagctccttctgcagtgcattccaccaggccacctacaacaagcagcccatg 

NNININNINNNNNNNIIIINI IIIIIIIINIIIIIIIIIIIIIIII 

atgocgctgagctccttctgcagtgcattccaccaggccacctacaacaagcagcccatg 12 96 
taccgcaaagccatctacgaagtcctccaggtggccagcagccgtgccgggaagctgttt 

N INIIIIIIIIIIIII MMINIINIINI INI INI IIIIINIII NNN II 

taccgcaaagccatctacgaagtcctccaggtggccagcagccgtgccgggaagctgttt 13 56 
ccagcttgccatgacagtgatgaaagtgacagtggcaaggctgtggaagtgcagaacaag 

II 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [ 1 1 1 1 1 1 M E 1 1 1 

ccagcttgccatgacagtgatgaaagtgacagtggcaaggctgtggaagtgcagaacaag 1416 
cagatgattgaatgggccctcggtggcttccagccctcgggtcctaagggcctggagcca 

II MM INIINI MMMIIIMI MM I lllllll II NNN NNIIIIIIINI 

cagatgattgaatgggccctcggtggcttccagccctcgggtcctaagggcctggagcca 1476 
ccagaagaagagaagaatccttacaaggaagtttacaccgacatgtgggtggagcctgaa 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 ! 1 1 1 1 1 1 1 ! 1 1 i 1 1 1 1 1 1 i E 1 1 1 1 1 F I ! ! ! 1 1 1 1 1 1 1 E 1 1 ! t i 1 M 

ccagaagaagagaagaatccttacaaggaagtttacaccgacatgtgggtggagcctgaa 153 6 
gcagctgcttacgccccacccccaccagccaagaaacccagaaagagcacaacagagaaa 

II MM IMIIIII INI I IN INIINI I lllllll INIIIN NNNNIINII 

gcagctgcttacgccccacccccaccagccaagaaacccagaaagagcacaacagagaaa 1596 
cctaaggtcaaggagatcattgatgagcgcacaagggagcggctggtgtatgaggtgcgc 

1 1 ! I ! 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 i ! 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ! M I ! I ! 1 1 1 ! 1 1 ! 1 1 1 1 1 

cctaaggtcaaggagatcattgatgagcgcacaagggagcggctggtgtatgaggtgcgc 1656 
cagaagtgcagaaacatcgaggacatttgtatctcatgtgggagcctcaatgtcaccctg 

NINNNIMMMIINII MIMNNNNNNNINNNIMIIIIIIIII 

cagaagtgcagaaacatcgaggacatttgtatctcatgtgggagcctcaatgtcaccctg 1716 
gagcacccactcttcattggaggcatgtgccagaactgtaagaactgcttcttggagtgt 

MINI IIIMIII Mil MM II 1 1 II II 1 1 II IMIIIII II II IMIIIII MINI 

gagcacccactcttcattggaggcatgtgccagaactgtaagaactgcttcttggagtgt 1776 
gcttaccagtatgacgacgatgggtaccagtcctattgcaccatctgctgtggggggcgt 
gcttaccagtatgacgacgatgggtaccagtcctattgcaccatctgctgtggggggcgt 183 6 
gaagtgctcatgtgtgggaacaacaactgctgcaggtgcttttgtgtcgagtgtgtggat 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 ii 1 1 1 ii i ii 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ii 1 1 1 

gaagtgctcatgtgtgggaacaacaactgctgcaggtgcttttgtgtcgagtgtgtggat 1896 
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ctcttggtggggccaggagctgctcaggcagccattaaggaagacccctggaactgctac 

I i 1 1 1 1 E 1 1 1 1 1 1 1 1 1 1 1 i 9 1 1 1 1 1 ! i r 1 1 1 F 1 1 1 1 1 1 E 1 1 1 J J 1 1 1 1 i 1 1 1 1 i I r I ! I ! 

ctcttggtggggccaggagctgctcaggcagccattaaggaagacccctggaactgctac 1956 
atgtgcgggcataagggcacctatgggctgctgcgaagacgggaagactggccttctcga 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

atgtgcgggcataagggcacctatgggctgctgcgaagacgggaagactggccttctcga 2016 
ctccagatgttctttgccaataaccatgaccaggaatttgaccccccaaaggtttaccca 

IIIIMIIIIIillllllllllMI MINIM III III MM II llllll Mill! II! 

ctccagatgttctttgccaataaccatgaceaggaatttgaccccccaaaggtttaccca 2 076 
cctgtgccagctgagaagaggaagcccatccgcgtgctgtctctctttgatgggattgct 

II IMIIIIIIIIIIIIIIIIIIIIIMIIMIIIMII llllll Ml III llllll II 

cctgtgccagctgagaagaggaagcccatccgcgtgctgtctctctttgatgggattgct 2136 
acagggctcctggtgctgaaggacctgggcatccaagtggaccgctacattgcctccgag 

1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 ! 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

acagggctcctggtgctgaaggacctgggcatccaagtggaccgctacattgcctccgag 2196 
gtgtgtgaggactccatcacggtgggcatggtgcggcaccagggaaagatcatgtacgtc 

I I i 1 1 1 1 E I E 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

gtgtgtgaggactccatcacggtgggcatggtgcggcaccagggaaagatcatgtacgtc 2256 
ggggacgtccgcagcgtcacacagaagcatatccaggagtggggcccattcgacctggtg 

1 1 1 1 1 1 1 1 E 1 1 1 1 1 1 1 f 1 1 1 1 E 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i i 1 1 1 1 

ggggacgtccgcagcgtcacacagaagcatatccaggagtggggcccattcgacctggtg 2316 
attggaggcagtccctgcaatgacctctccattgtcaaccctgcccgcaagggactttat 

I M 1 1 1 1 1 1 1 1 1 1 1 II II 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 II I i I M 1 1 1 1 1 1 1 1 1 1 1 f 1 1 

attggaggcagtccctgcaatgacctctccattgtcaaccctgcccgcaagggactttat 23 76 
gagggtactggccgcctcttctttgagttctaccgcctcctgcatgatgcgcggcccaag 

MMMMIIMIIIIIIMI II MM IIIIIMM IIIIMMMIMM llllll Ml 

gagggtactggccgcctcttctttgagttctaccgcctcctgcatgatgcgcggcccaag 2436 
gagggagatgatcgccccttcttctggctctttgagaatgtggtggccatgggcgttagt 

iiiiiiiiiiiiiiiiiiiii iiiiiiiiiiiiiiiiiiiiiii iiiiiii iiiiii in 

gagggagatgatcgccccttcttctggctctttgagaatgtggtggccatgggcgttagt 24 96 
gacaagagggacatctcgcgatttcttgagtctaaccccgtgatgattgacgccaaagaa 

1 1 1 1 1 i 1 1 1 ] E 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ! 1 1 1 r 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 ] I 

gacaagagggacatctcgcgatttcttgagtctaaccccgtgatgattgacgccaaagaa 2556 
gtgtctgctgcacacagggcccgttacttctggggtaaccttcctggcatgaacaggcct 

I I I i.i 1 1 1 ii 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ii 1 1 1 1 1 1 1 1 1 ii 1 1 1 1 1 

gtgtctgctgcacacagggcccgttacttctggggtaaccttcctggcatgaacaggcct 2616 
ttggcatccactgtgaatgataagctggagctgcaagagtgtctggagcacggcagaata 

iiiiiiiiiiiiiiiiiiiii iiiiiiiiiiiiiiiiiiiiiii iiiiiii iiiiiii 

ttggcatccactgtgaatgataagctggagctgcaagagtgtctggagcacggcagaata 2676 
gccaagttcagcaaagtgaggaccattaccaccaggtcaaactctataaagcagggcaaa 
gccaagttcagcaaagtgaggaccattaccaccaggtcaaactctataaagcagggcaaa 2736 
gaccagcatttccccgtcttcatgaacgagaaggaggacatcctgtggtgcactgaaatg 

IIMIIIIIIIIIIIIIIIII Ml MMMMIIIIIII Mill Mill II llllll II 

gaccagcatttccccgtcttcatgaacgagaaggaggacatcctgtggtgcactgaaatg 2796 
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gaaagggtgtttggcttccccgtccactacacagacgtctccaacatgagccgcttggcg 

II M<llll M 1 1 II I II II II MM I Mill IMIIMI II I Mill II I II 1 1 1 1 MM 

gaaagggtgtttggcttccccgtccactacacagacgtctccaacatgagccgcttggcg 2856 
aggcagagactgctgggccgatcgtggagcgtgccggtcatccgccacctcttcgctccg 

llllll III llllllllllllllll III III IMMIIMMM MIMMIMIIMI 

aggcagagactgctgggccgatcgtggagcgtgccggtcatccgccacctcttcgctccg 2 916 
ctgaaggaatattttgcttgtgtgtaagggacatgggggcaaactgaagtagtgatgata 

III llllllllllllllll III MINI llllllllllllllll llllllllllllllll 

ctgaaggaatattttgcttgtgtgtaagggacatgggggcaaactgaagtagtgatgata 2 976 
aaaaagttaaacaaacaaacaaacaccaagaacgagaggacggagaaaagttcagcaccc 

III III Ml MIIIIIIMIIIIII Ml III Mlllll llllll INN IMIIIIIIII 

aaaaagttaaacaaacaaacaaacaccaagaacgagaggacggagaaaagttcagcaccc 3 037 
agaagagaaaaaggaatttaaagcaaaccacagaggaggaaaacgccggagggcttggcc 

i i 1 1 1 1 1 1 1 1 m 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 1 E 1 1 ! 

agaagagaaaaaggaatttaaagcaaaccacagaggaggaaaacgccggagggcttggcc 3098 
ttgcaaaagggttggacatcatctcctgagttttcaatgttaaccttcagtcctatctaa 

II 1 1 1 II II 1 1 1 1 1 1 1 1 1 1 II 1 1 1 1 II 1 1 1 1 M 1 1 1 1 1 1 1 1 II 1 1 II II 1 1 1 M 1 1 1 II I 

ttgc'aaaagggttggacatcatctcctgagttttcaatgttaaccttcagtcctatctaa 3158 
aaagcaaaataggcccctccccttcttcccctccggtcctaggaggcgaactttttgttt 

ii 1 1 1 1 ii 1 1 1 1 1 1 1 1 1 1 1 ii 1 1 1 1 ii I M 1 1 1 1 1 ii i ii i ii i ii 1 1 1 1 1 1 1 1 1 1 1 1 M 

aaagcaaaataggcccctccccttcttcccctccggtcctaggaggcgaactttttgttt 3218 
tctactctttttcagaggggttttctgtttgtttgggtttttgtttcttgctgtgactga 

I I M III 1 1 1 1 1 M 1 1 1 1 M I M I M M 1 1 M 1 1 M I M I M I M I M M 1 1 M 1 1 1 1 1 1 

tctactctttttcagaggggttttctgtttgtttgggtttttgtttcttgctgtgactga 3278 
aacaagagagttattgcagcaaaatcagtaacaacaaaaagtagaaatgccttggagagg 

Mill IMMIIMMM III III III MMMMMMIMIMI IIIIIIIMIII 

aacaagagagttattgcagcaaaatcagtaacaacaaaaagtagaaatgccttggagagg 33 3 8 
aaagggagagagggaaaattctataaaaacttaaaatattggttttttttttttttcctt 

Ml IIIIIIIIMMIIIIIII Ml III llllllllllllllllllllllllllllllll 

aaagggagagagggaaaattctataaaaacttaaaatattggttttttttttttttcctt 33 98 
ttctatatatctctttggttgtctctagcctgatcagataggagcacaaacaggaagaga 

Ml MIIIMIIIMIIIMIIMI MMMIIIMMIIIMI llllllllllllllll 

ttctatatatctctttggttgtctctagcctgatcagataggagcacaaacaggaagaga 3458 
atagagaccctcggaggcagagtctcctctcccaccccccgagcagtctcaacagcacca 

II llllllllllllllll Ml Ml Ml IIIIIIIIIIIIIIMMI IIIIIIIIIIMI 

atagagaccctcggaggcagagtctcctctcccaccccccgagcagtctcaacagcacca 3518 
ttcctggtcatgcaaaacagaacccaactagcagcagggcgctgagagaacaccacacca 

II MMIIIIIIMIMI llllll Ml llllllllllllllll Ml M MIMIIIIII 

ttcctggtcatgcaaaacagaacccaactagcagcagggcgctgagagaacaccacacca 357 8 
gacactttctacagtatttcaggtgcctaccacacaggaaaccttgaagaaaaccagttt 

III Ml 1 1 Ml 1 1 MM 1 1 III III III III Mlllll I II II 1 1 II II IMIIIIIIII 

gacactttctacagtatttcaggtgcctaccacacaggaaaccttgaagaaaaccagttt 363 8 
ctagaagccgctgttacctcttgtttacagtt 
ctagaagccgctgttacctcttgtttacagtt 3 670 
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Exhibit C 

Alignment spanning the coding region of mouse Dnmt3b from ATCC Deposit No. 
209934 (top) and currently amended SEQ ID NO:2 (bottom) 2 

caggaaacaatgaagggagacagcagacatctgaatgaagaagagggtgccagcgggtat 

1 1 1 i I j 1 1 1 1 1 1 1 1 1 1 i I [ 1 1 j 1 1 1 i I i 1 1 ! 1 1 1 i 1 1 1 1 1 1 1 [ 1 1 1 1 1 1 1 1 1 1 1 1 1 E 1 1 1 

caggaaacaatgaagggagacagcagacatctgaatgaagaagagggtgccagcgggtat 319 
gaggagtgcattatcgttaatgggaacttcagtgaccagtcctcagacacgaaggatgct 

IIIIIIMIIIIIIIIIIII llllllllll IMIIIIIIIIMIEII IIIIIIMIMI I 

gaggagtgcattatcgttaatgggaacttcagtgaccagtcctcagacacgaaggatgct 3 79 
ccctcacccccagtcttggaggcaatctgcacagagccagtctgcacaccagagaccaga 

lllllll IMIIIIIIIMI IIIMIillllll llilllllllllll MIIIMIIIII I 

ccctcacccccagtcttggaggcaatctgcacagagccagtctgcacaccagagaccaga 43 9 
ggccgcaggtcaagctcccggctgtctaagagggaggtctccagccttctgaattacacg 

1 1 Mill I II llllllllll I III II I III III II Mill MM II I II I II 1 1 II I II I 

ggccgcaggtcaagctcccggctgtctaagagggaggtctccagccttctgaattacacg 499 
caggacatgacaggagatggagacagagatgatgaagtagatgatgggaatggctctgat 

MMMMMMMMMMI MINI MMMMMMMMMM IIMMMMII I 

caggacatgacaggagatggagacagagatgatgaagtagatgatgggaatggctctgat 559 
attctaatgccaaagctcacccgtgagaccaaggacaccaggacgcgctctgaaagcccg 

I ! 1 1 1 1 1 1 II 1 1 II 1 1 1 1 1 1 1 1 1 ! 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 II 1 1 1 1 ! 1 1 1 1 1 1 

attctaatgccaaagctcacccgtgagaccaaggacaccaggacgcgctctgaaagcccg 619 
gctgtccgaacccgacatagcaatgggacctccagcttggagaggcaaagagcctccccc 

I IIIIIIMIIIIIIIIIIII llllll llllll IMIIIIIIIIIII llllll llilll I 

gctgtccgaacccgacatagcaatgggacctccagcttggagaggcaaagagcctccccc 67 9 
agaatcacccgaggtcggcagggccgccaccatgtgcaggagtaccctgtggagtttccg 

II Mill llllll II MM II llllll III II IMIIIIIIIIIII III III III MM 

agaatcacccgaggtcggcagggccgccaccatgtgcaggagtaccctgtggagtttccg 73 9 
gctaccaggtctcggagacgtcgagcatcgtcttcagcaagcacgccatggtcatcccct 

II I Mill llllll MM III III III MM II III III MM II MM MM III MM 

gctaccaggtctcggagacgtcgagcatcgtcttcagcaagcacgccatggtcatcccct 799 
gccagcgtcgacttcatggaagaagtgacacctaagagcgtcagtaccccatcagttgac 

MMMMMMMMMMI III III I II 1 1 II II II II 1 1 1 1 II II II I II III MM 

gccagcgtcgacttcatggaagaagtgacacctaagagcgtcagtaccccatcagttgac 859 
ttgagccaggatggagatcaggagggtatggataccacacaggtggatgcagagagcaga 

MMMMMMMMMMI llllll I II 1 1 1 II 1 1 1 II II 1 1 II I II II I II II MM 

ttgagccaggatggagatcaggagggtatggataccacacaggtggatgcagagagcaga 919 
gatggagacagcacagagtatcaggatgataaagagtttggaataggtgacctcgtgtgg 

III III IIIIIIMIMIIIIIIIII III III IMIIIIIIIIIII lllllllll MM 

gatggagacagcacagagtatcaggatgataaagagtttggaataggtgacctcgtgtgg 979 



ggaaagatcaagggcttctcctggtggcctgccatggtggtgtcctggaaagccacctcc 

I Ml III lllllll II I MM Mill I llllll I MM II II II 1 1 MM llllll II II 

ggaaagatcaagggcttctcctggtggcctgccatggtggtgtcctggaaagccacctcc 103 9 



2 Bolded nucleotides indicate nucleotides that were amended on July 23, 2001. 
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aagcgacaggccatgcccggaatgcgctgggtacagtggtttggtgatggcaagttttct 

MIMII llllllllllllll III III Ml IIIIIIIIIIMIIIIMII llllll MM 

aagcgacaggccatgcccggaatgcgctgggtacagtggtttggtgatggcaagttttct 1099 
gagatctctgctgacaaactggtggctctggggctgttcagccagcactttaatctggct 

1 1 1 1 1 1 E 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 j 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

gagatctctgctgacaaactggtggctctggggctgttcagccagcactttaatctggct 1159 
accttcaataagctggtttcttataggaaggccatgtaccacactctggagaaagccagg 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 E 

accttcaataagctggtttcttataggaaggccatgtaccacactctggagaaagccagg 1219 
gttcgagctggcaagaccttctccagcagtcctggagagtcactggaggaccagctgaag 

I M I II I II II I II 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 II 1 1 1 1 1 1 1 1 1 1 II 1 1 1 1 1 1 II 

gttcgagctggcaagaccttctccagcagtcctggagagtcactggaggaccagctgaag 1279 
cccatgctggagtgggcccacggtggcttcaagcctactgggatcgagggcctcaaaccc 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 II I II I II 1 1 1 1 II 1 1 1 II I II II I II 1 1 II 1 1 1 1 M I II 1 1 1 1 1 

cccatgctggagtgggcccacggtggc t tcaagcc t actgggatcgagggcct caaaccc 13 3 9 
aacaagaagcaaccagtggttaataagtcgaaggtgcgtcgttcagacagtaggaactta 

I II 1 1 1 1 II 1 1 1 1 1 1 II 1 1 1 1 1 1 1 M 1 1 1 1 1 1 II 1 1 1 1 II I II I II I II I II I II II 1 1 1 

aacaagaagcaaccagtggttaataagtcgaaggtgcgtcgttcagacagtaggaactta 13 99 
gaacccaggagacgcgagaacaaaagtcgaagacgcacaaccaatgactctgctgcttct 

1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i I M 1 1 1 1 1 1 1 1 1 1 1 1 1 E 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

gaacccaggagacgcgagaacaaaagtcgaagacgcacaaccaatgactctgctgcttct 1459 
gagtcccccccacccaagcgcctcaagacaaatagctatggcgggaaggaccgaggggag 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 ! 1 1 1 1 1 1 1 1 1 

gagtcccccccacccaagcgcctcaagacaaatagctatggcgggaaggaccgaggggag 1519 
gatgaggagagccgagaacggatggcttctgaagtcaccaacaacaagggcaatctggaa 

II 1 1 1 1 II I II 1 1 MM MM Mill MIMII II 

gatgaggagagccgagaacggatggcttctgaagtcaccaacaacaagggcaatctggaa 1579 
gaccgctgtttgtcctgtggaaagaagaaccctgtgtccttccaccccctctttgagggt 

M Mill Mill IMM M I I M IMM MM MMIM MMI I 

gaccgctgtttgtcctgtggaaagaagaaccctgtgtccttccaccccctctttgagggt 163 9 
gggctctgtcagagttgccgggatcgcttcctagagctcttctacatgtatgatgaggac 

IIIIIIIIIIIMIIIIIIIIIIIII MMMIIMMIMM III MIMIMIIMI 

gggctctgtcagagttgccgggatcgcttcctagagctcttctacatgtatgatgaggac 1699 
ggctatcagtcctactgcaccgtgtgctgtgagggccgtgaactgctgctgtgcagtaac 

M 1 1 ii 1 1 1 1 1 M 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 M I M n 1 1 1 1 1 1 1 1 1 1 1 M 1 1 

ggctatcagtcctactgcaccgtgtgctgtgagggccgtgaactgctgctgtgcagtaac 1759 
acaagctgctgcagatgcttctgtgtggagtgtctggaggtgctggtgggcg^ 
acaagctgctgcagatgcttctgtgtggagtgtctggaggtgctggtgggcgcaggcaca 1819 
gctgaggatgccaagctgcaggaaccctggagctgctatatgtgcctccctcagcgctgc 

I iiiiiiiiiiiiiiiiiiiiiiiiiii iiiiiiiiiiiiiiiiiiiiii 

gctgaggatgccaagctgcaggaaccctggagctgctatatgtgcctccctcagcgctgc 187 9 
catggggtcctccgacgcaggaaagattggaacatgcgcctgcaagacttcttcactact 

MMIMMMMMIIIIIMI MMMIMIIMIIMMMMMIIIIMMIII 

catggggtcctccgacgcaggaaagattggaacatgcgcctgcaagacttcttcactact 193 9 
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gatcctgacctggaagaatttgagccacccaagttgtacccagcaattcctgcagccaaa 

llllilllllllllllllMI Ml lllllllllllllllllll! IIMIIIIIIIIIIII 

gatcctgacctggaagaatttgagccacccaagttgtacccagcaattcctgcagccaaa 1999 
aggaggcccattagagtcctgtctctgtttgatggaattgcaacggggtacttggtgctc 

IIIIIIMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 

aggaggcccattagagtcctgtctctgtttgatggaattgcaacggggtacttggtgctc 2059 
aaggagttgggtattaaagtggaaaagtacattgcctccgaagtctgtgcagagtccatc 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ii 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

aaggagttgggtattaaagtggaaaagtacattgcctccgaagtctgtgcagagtccatc 2119 
gctgtgggaactgttaagcatgaaggccagatcaaatatgtcaatgacgtccggaaaatc 

I IIMIIIIIIIIIIIIMIMIIIIMIMIMIIIIIIIIII IIMM llllllllll 

gctgtgggaac tgttaagcatgaaggccagatcaaatatgt caatgacgt ccggaaaat c 217 9 
accaagaaaaatattgaagagtggggcccgttcgacttggtgattggtggaagcccatgc 

! Illlllllllllll Ml IMIIIIIMMIIIIMIIIIIIIMII III IMIIIIMI 

accaagaaaaatattgaagagtggggcccgttcgacttggtgattggtggaagcccatgc 223 9 
aatgatctctctaacgtcaatcctgcccgcaaaggtttatatgagggcacaggaaggctc 

I MIMIIIIIIMI Ml MIMI lllllllllllllllllll III III llllllllll 

aatgatctctctaacgtcaatcctgcccgcaaaggtttatatgagggcacaggaaggctc 2299 
ttcttcgagttttaccacttgctgaattatacccgccccaaggagggcgacaaccgtcca 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ii 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ii 1 1 1 ii 1 1 1 1 1 1 1 1 1 

ttcttcgagttttaccacttgctgaattatacccgccccaaggagggcgacaaccgtcca 23 59 
ttcttctggatgttcgagaatgttgtggccatgaaagtgaatgacaagaaagacatctca 

I MIIIIIIMIIIMII Ml MIMMIIIIMIIIIMI IMIIMIIMIIIMIII 

ttcttctggatgttcgagaatgttgtggccatgaaagtgaatgacaagaaagacatctca 2419 
agattcctggcatgtaacccagtgatgatcgatgccatcaaggtgtctgctgctcacagg 

IIIIMIIIIIIIIIIIIIIMIMIIIIIIIIIIIIIMIIIMIIIIIIIIIIIIIII 

agattcctggcatgtaacccagtgatgatcgatgccatcaaggtgtctgctgctcacagg 2479 
gcccggtacttctggggtaacctacccggaatgaacaggcccgtgatggcttcaaagaat 

IIIIIIMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMIII 

gcccggtacttctggggtaacctacccggaatgaacaggcccgtgatggcttcaaagaat 253 9 
gataagctcgagctgcaggactgcctggagttcagtaggacagcaaagttaaagaaagtg 

MMIIMMIIMMMMMI MIIMMMMMMMMMMMMMMIMI 

gataagctcgagctgcaggactgcctggagttcagtaggacagcaaagttaaagaaagtg 2 599 
cagacaataaccaccaagtcgaactccatcagacagggcaaaaaccagcttttccctgta 

llllllllllllllllllllllll I II IMIIIIIIIIIIII II Ml Ml II Illlllll 

cagacaataaccaccaagtcgaactccatcagacagggcaaaaaccagcttttccctgta 265 9 
gtcatgaatggcaaggacgacgttttgtggtgcactgagctcgaaaggatcttcggcttc 

lilllMIIIIMIIMI II! MMIMIMIMMIIMI Ml MM MINIMUM 

gtcatgaatggcaaggacgacgttttgtggtgcactgagctcgaaaggatcttcggcttc 2719 
cctgctcactacacggacgtgtccaacatgggccgcggcgcccgtcagaagctgctgggc 

I Mi l III II III II Ml Ml II 1 1 II 1 1 Mill III I Ml I II I II I II MM MM II 

cctgctcactacacggacgtgtccaacatgggccgcggcgcccgtcagaagctgctgggc 277 9 
aggtcctggagtgtaccggt cat cagacacctgtttgcccccttgaaggac tact ttgcc 

MM MMMMMIIIIMI MIMIIMMIMIIIMI IMIMIIII lllllllll 

aggtcctggagtgtaccggtcatcagacacctgtttgcccccttgaaggactactttgcc 283 9 



- 4 - Li et al. 

Appl. No. 09/720,086 



tgtgaatagttctacccaggactggggagctctcggtcagagccagtgcccagagtc 

Ml II! Ill Ml Ml MIIIIIIIIIMIIIIMIIIMIIIII Ml llllll III! 

tgtgaatagttctacccaggactggggagctctcggtcagagccagtgcccagagtc 2 896 
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LETTER 

Estimation of Errors in "Raw" DNA 
Sequences: A Validation Study 

Peter Rich te rich 1 

Genome Therapeutics Corp., Waltham, Massachusetts 02154 USA 



As DNA sequencing is performed more and more in a masH'roductioivlike manner, efficient quality control 
measures become increasingly important For process control, but so also does the ability to compare different 
methods and projects. One of the fundamental quality measures In sequencing projects is the position-specific 
error probability at all bases in each individual sequence. Accurate prediction of base-specific error rates from 
"raw" sequence data would allow immediate quality control as well as benchmarking different methods and 
projects while avoiding the inefficiencies and time delays associated with resequenclng and assessments after 
"finishing" a sequence. The program PHRED provides base-specific quality scores that are logarythmically 
related to error probabilities; This study assessed the accuracy of PHRED's error-rate prediction by analyzing 
sequencing projects from six different large-scale sequencing laboratories. AH projects used four-color 
fluorescent sequencing, but the sequencing methods used varied widely between the different projects. The 
results indicate that the error-rate predictions such as those given by PHRED can be highly accurate for a large 
variety of different sequencing methods as well as over a wide range of sequence quality. 



In DNA sequencing, knowledge about the accuracy 
of sequences can be very valuable. For example, dif- 
ferent large-scale sequencing projects may produce 
sequences at similar rates and costs but with signifi- 
cantly different error rates in the final sequence. 
One major determinant in the final error rate is the 
accuracy of the "raw" sequence. Knowledge about 
the frequency and location of errors in the raw se- 
quence data can help to direct "polishing" efforts to 
the places where additional effort is needed; it also 
enables the comparison between different sequenc- 
ing projects Without requiring that the same region 
be sequenced in each project. 

Another area where estimates about sequence 
error rales would be beneficial is technology devel- 
opment. Accurateerror estimates at each base would 
enable "quality benchmarking" between; different 
methods, thus 'enabling -researchers to choose the 
method that fills their needs for accuracy and 
throughput best. 

Several groups have developed mathematical 
models to predict the error probability at any given 
position in raw sequences. Lawrence and Solovyev 
used linear discriminant analysis to calculate sepa- 
rate probability estimates for Insertions, deletions, 
and mismatches (Lawrence and Solovyev 1994). raw- 
ing and Green (1998) developed the program 



1 E-MAIL p*t€r.rl<hterlch@genomecorp.com; FAX (781) 895- 
9535. 



PHRKD, which calculates a quality score at each 
base. This quality score q is logarithmically linked to 
the error probability ^; q = -10 x "log lo (p) (for a 
discussion of how quality scores arc calculated and 
what the limitations are, see Ewing et al. (1998). 
When used in combination with sequence assembly 
and finishing programs that utilize these error esti- 
mates, reliable error probabilities promise to in- 
crease the accuracy of consensus sequences and to 
reduce the efforts required in the finishing phase of 
sequencing projects (Churchill and Waterman 
1992; Bonfieid and Staden 1 995). 

To examine the accuracy of. .probability esti- 
mates made by the program PHRED, we compared 
the actual and predicted error rates for six different 
cosmid- or BAC-sized projects thai were produced 
by six different large-scale sequencing centers in the 
United States. All of these six projects used four- 
color fluorescent sequencing machines; however, 
the DNA preparation methods, sequencing en- 
zymes/ fluorescent dyes and chemistries, and gel 
lengths varied significantly between the six groups. 
Table 1 gives ail overview of the sequencing projects 
analyzed. Table 2 lists the different methods used. 

RESULTS 

Error Rate Prediction Accuracy for Six Projects 

A comparison of actual and predicted error rates for 
the six projects in this study is shown in Table 3. 
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Table 1 . Summary of Data Sets 



Project 


Reads 


Aligned 
bases 


Average 
aligned 
reau 
length 


A 


455 


416,214 


915 


B 


1277 


871,230 


682 


C 


1065 


603,655 


567 


D 


834 


414,595 


497 


E 


1638 


1,149,209 


702 


F 


1885 


907,796 


482 


Total 


7154 


4,362,699 


610 



The results indicate that PHRED is very successful in 
identifying bases with low erroT probabilities. For 
example, the 1.28 million bases with quality scores 
of 4-12 (corresponding to error probabilities be- 
tween 39.8% and 6.3%) contain a total of 187,926 
errors. In contrast, the 1.44 million bases with qual- 
ity scores between 33 and 42 (corresponding to error 
probabilities between 0.05% and 0.006%) contain 
only 237 errors, which translates into a 790-fold 
lower error rate. The trend toward lower error rates 
can also be observed for each individual project. In 
most cases, the actual number of errors is close to 
the predicted error rate. It is also apparent that the 
actual error rale is typically lower than the predicted 
error rate. 

Both the- high overall accuracy and the ten- 
dency to slightly overpredict errors arc confirmed 
by statistical analysis, as shown in Table 4. The cor- 
relation between predicted and actual error frequen- 
cies is excellent for all projects (Spearman correla- 
tion coefficient >0.89, P < 0,0001), Averaged over all 
projects, the actual error rate is 84.5% of the pre- 
dicted error rate; the slope of the relation between 
predicted and actual error rates differs slightly be- 
tween projects and ranges from 76.6% to 88.4%. To 
put these differences between projects in relation, it 
is worthwhile remembering that PIIRED quality 
scores cover a wide dynamic range; The maximum 
quality score of 51 corresponds to a 50,000-fold 
lower predicted error rate than the minimum qual- 
ity score of 4. Even the relative difference between 
successive quality fs larger than the relative differ- 
ence in the slopes; for example, a quality score of 10 
corresponds to an error probability of 10%, whereas 
a score of 9 corresponds to an error probability of 
12.6%. 

A different way of looking at the relation be- 
tween the actual and predicted error rates is shown 



in Figure 1. Here, the error rates as a function of the 
position within all reads in each of the projects, av- 
eraged over 50-base windows, is depicted. For all six 
projects, the predicted error rates are very close to 
the actual error rates over the entire length of the 
sequences. Each project has a characteristic distribu- 
tion of error rates, which differs from each of the 
other projects. Hie minimum error Tate differs dra- 
matically between projects. The best projects 
achieve raw erroT rates of 0.23%-0.36% in the best 
region of the sequence read, typically from base 150 
to 200. The worst project in the data set had an 
~10-fold higher error rate of 2.58%. 

Toward the end of sequence reads, the error 
rates increase and start to exceed 10% between bases 
300 and 700. In projects that used mainly short gels 
(e.g., projects D and V), this increase begins sooner, 
whereas projects that use longer gels show a mark- 
edly longer stretch of low error rates (e.g., projects A 
and B). 

Table 5 summarizes key results for the six 
projects. The first four projects have similar mini- 
mum and average error rates. However, the length 
of the region where the error rate is below 5% differs 
significantly, from 403 to 682 bases. The project 
with the shorter low error rate regions contained 
larger portions of reads generated on short gels, 
whereas projects A and B were run exclusively on 
long gels (ABI373 stretch or AB1377 sequencers). 
Other factors contributing to differences between 
the first four projects were differences in sequencing 
chemistries, production scale, and electrophoresis 
conditions and machines. 

Project E and, in particular, project F ; had sig- 
nificantly higher error rates than the first four 
projccLs. In projects E arid F, every sequence gener- 
ated for the project had been included in the data 
set, whereas the other four projects had eliminated 
some "bad" sequences through manual or auto- 



Table 2. Overview of Sequencing Methods 
Used in the Different Projects 



Template DNA 

Sequencing 

enzymes 
Sequencing 

chemistries 
Sequencing 

machines 
Gel length 



single-sLranded M13, 

double-stranded plasmids 
Scquenase, Taq f KlenTaqTR, 

AmpliTaq FS 
Dyes primer (two different dyes 

chemistries), dye terminator 
ABI 373, ABI 373 stretch, 

ABI 377 
Only short gels, only long gels, 

mixes of short and long gels 
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Table S. Comparison of Predicted and Actual Error Rates for Six Different 
Sequencing Projects 



Project : 


Quality score 


4-12 


13-22 


23-32 


33-42 


43-51 


a 


aligned bases 
expected errors 
actual errors 


119,246 

20,256 

16,784 


75,293 

2,064 

1,758 


70,391 

172 

127 


144,876 

37 

17 


73,234 

1 

1 


B 


aligned bases 
expected errors 
actual errors 


182,034 

29,953 

26,038 


137,940 

3,704 

2,536 


181,998 

410 

287 


399,690 

102 

35 


140,176 

3 

0 


C V 


aligned bases 
expected errors 
actual errors 


139,345 

22,277 

16,670 


131,419 
3,41 1 
1,513 


151,197 

357 

194 


292,070 
74 

26 


68,529 

2 

3 


D 


aligned bases 
expected errors 
actual errors 


103,898 

16,880 

14,495 


68,995 
1,919 
1 ,924 


68,613 

168 

146 


153,730 

38 

59 


111,752 

3 

2 


E .. 


aligned bases 
expected errors 
actual errors 


378,755 

63,947 

55,968 


217,438 

6,336 

6,516. 


167,968 

418 

355 


392,717 

95 

67 


144,313 

4 

5 


F 


: aligned bases 
expected errors 
actual errors 


359,809 

66,938 

57,971 


136,688 

4,079 

3,856 


98,840 

256 

332 


64,035 

23 

33 


5,130 

0 

1 


All 


aligned bases 
expected errors 
actual errors 


1,283,087 

220,252 

187,926 


767,773 

21,513 

18,103 


739,007 

1,781 

1,441 


1,447,118 
370 

237 


543,134 
13 

12 



matic Inspection. After eliminating <10% of the The last column in Table 5 shows the average 

worst sequences in project K, the error rates for the number of bases with an estimated error probability 

remaining sequences were comparable to those of of at most 0.1%; which is equivalent to a quality 

the first four projects. In contrast, project F showed score of at least 30. The count of such "very high- 

a much more uniform distribution of sequence quality" bases is a good indicator of sequence qual- 

quality. Ily, both for individual sequences and, when a ver- 



Table 4. Summary of Statistical Analysis Results 





Spearman 






r ratio 


P>|t) 


Project 


P 


P> 1P| 


Slope 


A 


0.9646 


<0 0001 


0.818 


75.1 


<0.000T 


B 


0.9890 


<0.0001 


0.874 


98.2 


<0.0001 


C 


0.9846 


<0.0001 


0.766 


71.6 


<0.0001 


D° 


0.8692 


<0.0001 


0.855 


68.3 


<0.0001 


E 


0.9956 


<0.0001 


0.884 


144.3 


<0.0001 


f 


0.9968 


<0.0001 


0.865 


151.6 


<0.0001 


All 


0.9964 


<0.0001 


0.845 


174.5 


<0.0001 



a ln project D y the Spearman correlation coefficient p was artificially low as only very few bases (1 0) bases had 
a quality score of 5, and none of these bases contained an actual error (expected: 3.1 6 errors). Exclusion of 
this quality score gave a Spearman correlation coefficient or 0.9786 (/'< 0.0001). The frequencies in the slope 
calculations were weighed by the number of bases at any given quality score and, thus, were not sensitive to 
such small sample distortions (see Methods). 
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Figure 1 Actual and predicted error rates in six different sequencing projects. Actual error rates and predicted 
error rates in 50-base windows over Ihe length of the sequence reads, averaged over all reads that could be aligned 
to the consensus sequence by CR05S_MATCH, are shown. The numbers on the x-axis show the first base in a given 
50-base window. 



aged over all sequences in a project, as an indicator 
for the entire project. Compared to the estimated 
error rates, the count of very high-quality bases is 
less prone to distortions from a small number of 
low-quality reads, as the data for project H demon- 
strate. 



Prediction Accuracy for Data Subsets of Different 
Quality 

The quality of sequences within any given project 
can vary substantially, and the use of predicted error 
rates has the potential to be a powerful tool for qual- 



Table 5. Comparison of Key Results for Six Different Sequencing Projects 



Actual minimum Actual average Length of Length of Average bases with 

Project error rate (%) error rate (%) <1 % error region <5% error region P(error) <0. 1% 



A 
B 
C 
D 
E 
F 



0.36 
0.34 
0.23 
0.39 
0.71 
2.58 



3.6 
2.8 
2.4 
3.1 
4.7 
9.2 



422 
274 
291 
300 
129 
0 



682 
567 
479 
403 
464 
162 



468 
395 
348 
294 
317 
79 
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ity analysis and control in large-scale DN A sequenc- 
ing projects. To analyze how accurate PHRED error 
estimates are for different quality sequences within 
the same sequencing project, we subdivided a data 
set into four quartiles, based on the number of very 
high-quality bases in each sequence (see Methods). 
The comparison of actual and predicted error rates is 
shown in Figure 2. 

When measured by the error rate in the best 
region of a sequence, the data quality in the differ- 
ent quartiles varies >100-fold between the best and 
the worst 25% of the sequences. The best quartile 
showed >-0.03% error for > 100 bases, whereas the 
error rate in the worst quartile always exceeded 5%. 
In quartiles 2 and 3, the predicted error rates match 
the actual error rates very closely. In the best and 
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Figure 2 Actual and predicted error rates in different quality subsets of project 
B. Sequence reads were sorted by the number ot bases wi th a predicted error rate 
of at most 0.1 % (very high-quality bases), and assigned to quartiles, with quartile 
1 corresponding to the highest numbers. Actual and predicted error rates for all 
sequences in each subset were calculated as in Fig. 1. Note that a number of 
sequence reads that had been rejected because of too low quality were added 
back to the data set for illustrative purposes, all of which are in quartile 4. These 
sequences were not included in the data sets used to generate Figs. 1 and 3 and 
Tables 1 and 3. 
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worst quartiles, PHRED's accuracy was somewhat 
lower from base 100 to 500. Tn the best sequences, 
PHRED's error estimates were about twofold too 
high; in the worst sequences, the error estimates 
were too low, again by a factor of 2. This underpre- 
diction of errors can be partially explained by the 
fact that PHRE!) gives ambiguous base calls (N's) a 
quality score of 4, corresponding to an error prob- 
ability of "39.8%; however, N's will always show up 
as an actual error. Even in the worst and best quar- 
tiles, however, the predicted error rale curves are 
very similar to the actual error rate curves. 

The results shown in Figure 2 also demonstrate 
that the count of very high-quality bases, or bases 
with an estimated error probability of at most 0.1 %, 
can be used effectively to characterize the overall 

quality of a sequence read. 
Sorting the sequence reads 
into quartiles based on the 
number of very high-quality 
bases worked well, as shown 
by the >1 00-fold difference in 
the minimum error rate be- 
tween the first and the fourth 
quartile. 

Other methods to cha rao 
terize the overall quality of in- 
dividual reads based on 
PHRED quality scores can give 
similar results. For example, 
counting bases above a miuu- 
mum quality threshold any- 
where in the range of 20-40 
gave similar results for most 
data scls (not shown), and 
such counts are used by a 
number of different laborato- 
ries as quality measures. Alter- 
natively, the quality values 
can be converted to error 
probabilities and averaged to 
give the predicted error rate 
for the tract; or summed to 
give the total predicted num- 
ber of errors in a trace. How- 
ever, such averages and totals 
can sometimes give a mislead- 
ing picture, as the following 
example illustrates. Assume 
that two sequence reads have 
very similar quality in the 
alignable part of the read but 
that one of the two sequences 
was run much longer and 
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Figure 3 Actual frameshift and total error rates for projects A and B. To calcu- 
late frameshift error rates, only insertions and deletions were counted. Mismatch 
errors, which account for the vast majority of errors after base 150, were included 
only in the total error count. Note that project B (A,A) has a slightly similar or 
slightly higher total error rate compared to project A (#;0) but only about 
one-third as many insertions and deletions up to base 500. For both projects, the 
frameshift error rate in the raw data is <1 in 1000 for >300 bases, and ^1 in 
10,000 for >1 00 bases in project B, 



sionally lead to questionable 
conclusions, as the results 
shown in I'igure ^illustrate. 

Figure 3 shows the total 
actual error rates and the 
frameshift error rates for two 
projects, A and B. '['he total er- 
ror rates for both projects arc 
similar for up to ASi) bases; af- 
ter 350 bases, project B has a 
somewhat higher total cttot 
rate. However, examining the 
frameshift error rate gives rise 
to a different picture: from 
base 1 to 500, project A has 
approximately four times as 
many insertions and dele- 
tions as project B. This differ- 
ence in frameshift error rates 
can be explained by the se- 
quencing chemistries that 
were used in the two projects. 
Project B, with the lower 
frameshift error rate, used 
only dye terminator chemis- 
try, which is known to elimi- 
nate band spacing artifacts 
from hairpin structures ("com- 



therefore contains a longer unalignable "tail" of 
very low-quality bases. When calculating the aver- 
age error rate for these Lwo sequences, the second 
sequence will have a much higher average error and, 
therefore, appear to be of lower quality. In contrast, 
the counts of very high-quality bases for both se- 
quences will be veiy similar, as the unalignable tails 
contain few, if any, high-quality bases. Therefore, 
counts of bases above a high enough quality thresh- 
old will give a more robust and clearer picture of 
trace quality. 

Frameshift Error Rates for Different Sequencing 
Chemistries 

Depending on how biologists use DNA sequences, 
knowledge about total error ral.es in raw sequences 
may or may not be sufficient. For example, frame- 
shift errors in coding sequences will generally lead 
to incorrectly predicted open reading frame, 
whereas mismatch errors will do so only if the mis- 
match introduces a stop codon or a new splice site. 
At the time of this writing, PHRED did not differen- 
tiate between mismatch and frameshift errors, but 
only estimated total error rates. This might op- 



pressions"). Project A, on the 
other hand, used dye primer chemistry, which is 
more prone lo insertion and deletion errors from 
mobility artifacts, for most sequencing reactions. 

DISCUSSION 

As large-scale DNA sequencing has become a more 
routine and common process, the traditional meth- 
ods for assessing sequence quality have become un- 
satisfactory. In projects like single-pass cl)N A se- 
quencing, it is not possible to calculate and compare 
error rates after finishing a sequence, as finishing 
never takes place. Even when a comparison between 
raw and finished sequence can be done, the time 
delay between raw data generation and quality as- 
sessment is often large. This delay makes it difficult 
to improve ongoing projects, and it sometimes 
makes it impossible to capture problems early on. 
Some immediate quality feedback can be reached by 
including known standard sequences for quality 
control. However, this approach can be costly, and 
it fails when error profiles differ between standard 
and unknown sequences. 

In contrast Lo these traditional methods to as- 
sess sequence accuracy, direct estimation of error 
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rates in raw sequence data would enable immediate 
quality control and feedback. Accurate, base-by- 
base estimates of error probabilities could also in- 
crease the utility of single-pass sequences signifi- 
cantly, allow efficient comparison and optimization 
of different: sequence chemistries, and enable the 
development of better software tools for sequence 
assembly and analysts. 

The critical question for any error rate predic- 
tion tool is how accurate are the error rate estimates, 
in particular if different sequencing methods and 
chemistries are used? The results presented herein 
provide an answer to this question for the program 
PHRED, as well as clues where further development 
would be useful. As shown in Tables 3 and 4 and in 
Figure 1, the agreement between predicted and ac- 
tual error rates was very good in each of the six 
different projects analyzed. The observed high level 
of prediction accuracy in all of these projects is al- 
most astonishing if one takes into account that ac- 
tual errors are binary (a base is cither correct or 
wrong), whereas predicted error rates are probabili- 
ties on a scale from 0,0 to 1.0. The observed ten- 
dency to overpredict error rates can be at least par- 
tially explained by the "small sample correction" 
that was used in the derivation of threshold param- 
eters for quality scores (Kwing and Green 1998), For 
most practical applications, such a somewhat con- 
servative estimation of quality scores is tolerable or 
even desirable. Overall, the results clearly show that 
error probabilities given by PHRED accurately de- 
scribe raw sequence data quality. 

In judging the usefulness of predicted error 
probabilities, it is important to know how differ- 
ences in sequencing methods will influence the pre- 
diction accuracy. For example, the larger variation 
in peak heights tends to be larger in dye terminator 
sequencing than in dye primer sequencing, arid dif- 
ferent sequencing en7ymes are known to produce 
different specific height variation patterns. Any es- 
timation of error probabilities that takes the pecu- 
liarities of a specific sequencing chemistry into ac- 
count would therefore be expected lo be less accu- 
rate for different chemistries. 

The projects included in this study were specifi- 
cally chosen to provide an initial answer to the 
question of how generally useful PHRK1) quality 
scores are. These projects represent the vast majori ty 
of different multicolor fluorescent sequencing 
methods used in the last 3 years: different template 
DNAs and DNA preparation methods, different en- 
zymes, gel lengths, run conditions, and different 
fluorescent dyes. The data also include a consider- 
able spread in data quality, both between projects 



and within individual projects. None of the projects 
analyzed here were included in PHRKD's training 
set, and just one of the six laboratories that contrib- 
uted data to this study also contributed data to the 
training data sets. One of the projects in this study 
consisted entirely of dye terminator sequences, 
which presented only a small fraction of the se- 
quences in the test data set. Another project exclu- 
sively used a set of fluorescent dyes different from 
those used in the training sets. Fach project differed 
from the other projects in this study in at least one, 
and typically many, experimental aspects like tem- 
plate preparation, sequencing enzymes, gel run con- 
ditions, and so forth/Despite these differences, the 
accuracy of error rate predictions was very similar 
for all projects. 

Our results justify some optimism about the ac- 
curacy of PHRED quality scores for minor changes 
in sequencing technology, for example, sequences 
generated by new enzymes and fluorescent dyes. 
Initial studies showed that PHRED quality scores 
were also accurate for sequences produced by mul- 
tiplex sequencing with radioactive detection (P. 
Richterich, unpubL). However, we also observed 
two effects thai can invalidate PHRED quality scores 
during these studies. First, sequences generated by 
chemical sequencing gave too low quality scores at 
mixed (A + G) reactions. Because secondary peak 
height is one of the parameters used in the error rate 
predictions, this is not surprising. Another potential 
source of error is high-frequency noise in the trace 
data. With such data, PHRED occasionally underes- 
timated the band spacing by a factor of 2 or more, 
which resulted in incorrect base calls and quality 
scores. By applying simple smoothing algorithms lo 
data with high-frequency noise, these problems 
could typically be resolved. Similar slops may be 
necessary to obtain accurate PHRED quality scores 
on dala that have been generated by different se- 
quencing instruments or preprocessed by different 
software. 

Accurate quality scores can have a major impact 
on how sequences are used downstream from the 
sequence production process. In traditional se- 
quencing projects where the goal is complete cov- 
erage at a final error rate below (e.g.) 1 in 10,000, the 
accuracy goals can be reached with single sequence 
reads as long as the quality scores are at least 40 
(however, other potential problems like clone insta- 
bility may make higher coverage advisable). Inter- 
esting questions arise as to how individual read 
quality contributes to project quality, or the error 
rate of the "final" sequence. Under the assumption 
that errors between different sequence reads arc 
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completely independent, one could argtie that two 
reads with a quality score of 20 (error probability of 
1 in 100) are just as valuable as one sequence with a 
quality score of 40 (error probability of 1 in 10,000). 
However, although a single sequence strclch with 
quality levels above 40 would give a final sequence 
with an error rate of <1 in 10,000, assembling a con- 
sensus from two sequences with quality scores of 20 
(1% error rate) could lead to one of two results: Tf 
the errors were completely random, the consensus 
sequence would be ambiguous at 2% of all loca- 
tions; if the errors were completely localized, for ex- 
ample, because of reproducible compressions, the 
consensus sequence would have one "hidden" error 
every 100 bases. Typically, consensus sequences de- 
rived from low-quality sequences will have both 
kinds of problematic regions. Increased coverage 
can rapidly eliminate the random errors; however, 
increased coverage does not resolve errors from sys- 
tematic sources. Manual examination of such prob- 
lem areas is generally required; such "con tig edit- 
ing/' however, tends to be time consuming, re- 
quires highly trained personnel, is an obstacle 
toward complete automation of DNA sequencing, 
and sometimes fails to eliminate all errors. This 
leads to the somewhat counterintuitive conclusion 
that the practical value of increasing sequence qual- 
ity can he even higher than indicalcd by the quality 
scores: One sequence of average quality above 40 
can be "worth" more than two sequences of average 
quality 20. 

Another application of DNA sequencing where 
high quality can be of disproportionately high value 
is the search for mutations in genomic DNA, In low 
quality sequences, secondary peaks and low resolu- 
tion ofLen complicate the identification of hetero- 
zygous mutations. In regions of higher sequence 
quality, such secondary peaks are smaller or absent 
and peaks are better resolved. Therefore, both false- 
positive and false-negative errors can be signifi- 
cantly reduced in high-quality regions. Tools like 
PHRED, which can accurately measure sequence 
quality from Lrace data, can he of twofold value for 
mutation detection. First, base-specific quality 
scores can allow optimization of sequencing meth- 
ods and strategies for mutation detection. Second, 
the quality scores can be used to evaluate the use- 
fulness of individual sequence reads for mutation 
detection (e.g., by discarding reads below minimum 
thresholds), and they can guide software that auto- 
matically detects mutations. 

The ability to predict error rates in a highly ac- 
curate fashion is likely to have a major impact in 
applications like those described above. PHRED is 



the first widely used program that accurately pre- 
dicts base-specific error probabilities. However, the 
algorithm for determining quality values has been 
described (Ewing and Green 1998), and it should be 
straightforward to implement similar quality values 
in other base-calling programs. Furthennore, an ex- 
tension of the approach developed by Kwing and 
Green should be possible. For example, differentia- 
tion between mismatch and frameshift errors would 
enabie better comparisons of sequencing methods 
with similar total error rates hut different frameshift 
error rates. Several groups have described efforts to 
calculate separate probabilities (or "confidence as- 
sessments") for mismatch errors and frameshift er- 
rors (Lawrence and Solovyev 1994; Berno 1996). 
Their results demonstrated that different ap- 
proaches to error type characterization arc feasible 
and promising. Implementation of such error type 
predictions in other programs similar to the way 
PHRED uses quality scores would enable better 
method assessments, benchmarking, and production 
quality control, and could have a significant impact 
on downstream uses of DNA sequence information. 

METHODS 

Data Sets 

! : or one project, sequence raw data in the form of 
ABT trace files were downloaded from a public FTP 
site. Sequence data for the five other projects were 
kindly provided by five different large-scale se- 
quencing groups. Table I gives a summary of the six 
projects, and Table 2 gives an overview of the dif- 
ferent sequencing methods used in the projects. The 
projects differed in the amount of prescreening of 
data that had been done, reflecting different ap- 
proaches to quality control in different laboratories. 
In two projects (II and C), different software pro- 
grams had been used to identify and eiiminate low- 
quality sequences. One project (F) included all data 
files generated^ whereas the other three projects had 
excluded "failed lanes/' 

Comparison of Acctial and Predicted Error Rates 

The sequences for all traces in each project were 
recalled using the program PHRED (v. 961028). 
Next, sequences in each project were assembled 
with PHRAP (P. Green, unpubl.). Slightly different 
methods were chosen for the statistical and graphi- 
cal evaluation of the error rate prediction accuracy, 
in the statistical evaluation, only the longest contig 
produced by PHRAP was considered. The tables of 
aligned bases and observed discrepancy counts for 



258 ^GLNOME RfSFARCH 



ESTIMATION OF ERRORS IN RAW DNA SEQUENCES 



each quality score were laken from the PHRAP out- 
put and analyzed as follows. The expected number 
of discrepancies (£) at each quality score (</) was cal- 
culated by multiplying the number of aligned bases 
(AO with tiic error probability corresponding to the 
quality score: E = N 10" aiH . Hie Spearman ranking 
coefficients were calculated by comparing the ex- 
pected and observed error frequencies. To obtain 
the quantitative relation between the expected and 
observed error rales over the entire range, a least- 
squares fit between the observed and expected rates 
was performed, with tile intercept set to zero and 
the number of aligned bases at each quality score 
used as weights. 

For a graphical comparison of estimated and ao 
tual error rates in 50-bp windows, the following 
steps were taken. For two of the projects, the con- 
sensus sequence was retrieved from public data- 
bases. l ; or the four other projects, the DNA sequence 
and quality information were used by the program 
PHRAP to assemble consensus sequences for each of 
the projects. The individual reads were aligned to 
the consensus sequences of the longest contig, us- 
ing the program CROSS_MATCH (P. Green, tin- 
publ.), after removing single-coverage regions from 
the ends of the consensus sequence. CROSS- 
_MATCH uses an implementation of the Smith 
Waterman algorithm to generate alignments that 
typically do not include the ends of sequences, 
where disagreements are commonly due to vector 
sequence or low quality sequence. 

Hie quality files generated by PHRED and the 
alignment summaries generated by CROSS- 

MATCH were then analyzed as follows. First, the 

region of each query sequence that had been aligned 
by CROSS_MATCH was determined. Next, the actual 
and predicted error rates for the entire aligned part of 
each individual sequence was calculated. In addi- 
tion, the average actual and predicted error rates for 
all alignable sequences together were calculated for 
windows of 50 bases in length. To calculate the pre- 
dicted error rate, the quality scores q determined by 
THKED at each base wore converted to error prob- 
abilities as described above (Ewing and Green 1 998). 

Subdividing Data Into Subsets Based on Data Quality 

To examine the accuracy of PHRED quality scores 
for data subsets of different quality within a project; 
tiie following approach was taken. For all sequence 
reads in project B, the number of bases with a qual- 
ity score of at least 30 in each sequence was deter- 
mined (bases with quality scores of at least 30 were 
called very high-quality bases, or VHQ bases). Se- 



quences were sorted in descending order based on 
the number of very high-quality bases, and divided 
into four quartiles. Accordingly, quartile 1 con- 
tained 25% of sequences with the highest number 
of very high-quality bases, and quartile 4 contained 
the "worst" sequences. To illustrate the prediction 
accuracy in data with relatively liigh error rates, se- 
quences from project B that had been "discarded" 
because they had not met the minimum quality cri- 
teria were added back to the data set. The sequences 
in each quartile were compared to the consensus se- 
quences that had been generated using the entire data 
set, as described above for the graphical comparison. 

Determining Actual Frameshift Error Rates 

The calculation of actual frameshift error rates in 
the raw sequence data was performed using CROSS 
_MATCH, similar to the procedure described above 
for total error rates, except that only insertion and 
deletion errors were counted. Because PHRED docs 
not give separate frameshift error estimates, a com- 
parison of predicted and actual frameshift errors is 
not possible. 
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